Environmental and Spatial Predictors of Bird Diversity across the United States

Spatial Analysis of BBS Data

Spatial Data
Birds
Python
Author

Matěj Tvarůžka

Published

February 13, 2026

1 Introduction

One of the basic questions in biology asks the following: What is the origin of biodiversity? Why are there so many distinct species at the same site? These questions can be examined through evolutionary and ecological perspectives, which are often connected. The ecological approach is crucial, because it links the evolutionary history of coexisting species with the environment in which they occur and examines the factors which allow not just the build-up of local (and consequently regional) species richness, but also the ways it is maintained over time. One of the factors that facilitates species coexistence in birds and therefore drives their species richness (number of species at the site) on the regional scale is primary productivity of the environment (Pigot, Tobias, and Jetz 2016).

At broad spatial scales, species richness typically increases with productivity (Hawkins, Porter, and Diniz-Filho 2003). In contrast, at finer spatial scales, the relationship between productivity and species richness is more complex: positive, negative, hump-shaped, and U-shaped relationships have all been documented (Mittelbach et al. 2001). There are several hypotheses for this relationship between productivity and species richness, but we still do not know the ultimate mechanism behind it (Di Cecco et al. 2022). We will look at how the productivity of vegetation drives the species richness on the continental scale of North America. For this, I will use the Breeding Bird Survey (BBS) dataset, which is one of the most significant and longest-running “citizen science” programmes in the world. Established in 1966, its primary goal was to monitor the status and trends of North American bird populations on a continental scale. This census operates on the landscape scale as the routes are about 40 km long with the 50 stops where the volunteers count birds. As the measure of primary productivity and environmental heterogeneity of the vegetation, I decided to use the Normalised Difference Vegetation Index (NDVI) which is used as a proxy for primary productivity in both regional (Nieto, Flombaum, and Garbulsky 2015) and local studies (Brüggeshemke and Fartmann 2025).

NoteResearch Questions
  • How vegetation density/heterogeneity predicts bird diversity across United States?
  • How spatial predictors explain the patterns of species richness in North American birds?
NoteTasks to cover
  1. Prepare and explore the data - clean BBS routes, extract NDVI for routes geometry, and get elevation data with earth engine
  2. Evaluate global and local spatial autocorrelation in species richness and model residuals
  3. Build the OLS model and compare it with GWR model
View Code
import geopandas as gpd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import httpx
import rioxarray
import xvec
import folium
import ee
import geemap
import esda
import contextily
from libpysal import graph
import statsmodels.formula.api as smf
import mgwr
from mgwr.gwr import GWR
from mgwr.sel_bw import Sel_BW

2 1. Data preparation

2.1 Study area

I chose the lower 48 states of USA as the study area because for them the route geometries are publicly available. In the basic dataset of BBS, the geometry is included only for the starting point of the routes, which are on average about 40 km long and therefore the single point does not optimally represent their geometry. Firstly, we need to select the boundary for the study area.

View Code
us_states = gpd.read_file("study_area/cb_2018_us_state_500k.shp")
exclude = ["AK", "HI", "PR", "VI", "MP", "GU", "AS"] # States outside the area of interest
us_states = us_states.query("STUSPS not in @exclude") # Accessing the exclude variable outside the query() string
us_boundary = us_states.dissolve()

2.2 BBS data

Now it is time to load the data about species richness, which consists of the number of species detected at the route. I decided to use data from 2018, which I already prepared in R, and computed species richness for each transect. This data has one geometry column with the coordinates of the starting point of the route. I downloaded the data at the official project site.

View Code
bbs = gpd.read_file("bbs/bbs_routes_SR_2018.gpkg", engine = "fiona")

bbs = bbs.loc[bbs["CountryNum"] == 840] # Select only US transects
bbs = bbs.loc[bbs["StateNum"] != 3] # Select every state except Alaska
bbs["RTENO"] = bbs["RTENO"].astype(str).str[3:] # Slice the string (skips the first 3 characters)

Let’s load the data for the routes. Unfortunately, the route geometries are often inaccurate and sometimes consists of several fragments, which can cause problems. Therefore, we first need to remove potentially wrong route geometries. The expected route geometry ought to be about 40 km long, so let us filter potential outliers from this value. This cleaning will result in the loss of several hundred datapoints, but we need precise route geometries for data extraction.

View Code
routes = gpd.read_file("routes/bbsrtsl020.shp")

routes = routes.to_crs(epsg=5070) # First set the crs to USA Contiguous Albers Equal Area Conic projection
routes["calc_length_km"] = routes.geometry.length / 1000 # Calculating the route length for given geometry
routes = routes.loc[routes["RTELENG"] <= 45000,] # Selecting routes under 45 km in length by the dataset measure
mask = (routes["calc_length_km"] >= 35) & (routes["calc_length_km"] <= 45) # Filter out fragments and over-long routes
routes_filtered = routes[mask].copy()

print(f"Original routes count: {len(routes)}")
print(f"Filtered routes count (35-45km): {len(routes_filtered)}")
print(f"Average route length: {routes_filtered["calc_length_km"].mean():.2f} km")
Original routes count: 3062
Filtered routes count (35-45km): 2570
Average route length: 40.14 km

Let us perform the spatial join of these two geodataframes. To combine BBS data with routes geometries we use spatial join using “inner” as we need matching values in both tables. I could also join the tables by column “RTENO”, but I want to be sure that the geometries match.

View Code
bbs = bbs.to_crs(routes_filtered.crs) # Make sure it's in the same coordinate system
bbs = bbs.drop(columns = "RTENO")

bbs["geometry"] = bbs.buffer(400) # Add a small buffer (400 m) to points to account for GPS inaccuracy

joined_data = gpd.sjoin(routes_filtered, bbs, how="inner", predicate="intersects") # Spatial join of bbs points data with the routes geometries

print(f"Routes in subset: {len(routes_filtered)}")
print(f"Routes with data: {len(joined_data["RTENO"].unique())}")
Routes in subset: 2570
Routes with data: 1611

The large drop in the number of routes is due to missing BBS data for the route geometries, which could be due to annual differences in the census - in some years, particular routes are not counted due to weather conditions, terrain obstacles or illness of the observer. However, the number is still very large, so there are probably some other errors and inaccuracies in geometry matching. Now we will do the final preparation of the table for the NDVI and elevation values extraction. We need to get rid of redundant columns, create buffer around the routes, add the route midpoint and create Lat/Lon columns for external API.

View Code
# Add .copy() here to break the link to joined_data
bbs_routes = joined_data[["RTENO", "species_richness", "geometry"]].copy()

bbs_routes["buff_3000"] = bbs_routes.buffer(3000)  # Creating 3 km buffer - buffer choice explained in NDVI extraction part
bbs_routes["route_midpoint"] = bbs_routes.geometry.interpolate(0.5, normalized=True) # Calculate the midpoint (50% along the line) as a reference geometry
bbs_routes = bbs_routes.drop(columns = "geometry") # We no longer need routes geometry

temp_gdf = bbs_routes.set_geometry("route_midpoint").to_crs(4326)
bbs_routes["lon"] = temp_gdf.geometry.x
bbs_routes["lat"] = temp_gdf.geometry.y
bbs_routes = bbs_routes.set_geometry("buff_3000")
bbs_routes.head()
RTENO species_richness buff_3000 route_midpoint lon lat
0 2072 63 POLYGON ((782767.664 1352804.909, 782613.87 13... POINT (799058.576 1358112.615) -87.165585 34.947661
2 2204 54 POLYGON ((911518.247 1350136.581, 911484.644 1... POINT (920348.359 1362929.663) -85.828469 34.882272
3 2073 65 POLYGON ((883221.105 1343004.45, 883424.384 13... POINT (896739.418 1353510.697) -86.098287 34.820791
4 2071 61 POLYGON ((766029.847 1327436.538, 765484.087 1... POINT (776778.876 1331128.227) -87.437364 34.725655
5 2007 63 POLYGON ((869910.569 1340151.781, 869882.814 1... POINT (879587.76 1341112.337) -86.300271 34.726327

2.3 NDVI raster

The normalised difference vegetation index (NDVI) indicates vegetation health within raster image pixels by quantifying how much near-infrared (NIR) light the plants reflect. Healthy vegetation reflects a higher proportion of NIR and a lower proportion of red light than stressed vegetation or non-living surfaces with similar visible colours (for example, turf fields). This contrast makes NDVI a practical tool for evaluating how healthy vegetation is in a raster image relative to its surroundings.

NDVI is calculated for each pixel with the following calculation:

\(\Large NDVI = \frac{(NIR - Red)}{(NIR + Red)}\)

This formula generates a value between -1 and +1. Low reflectance in the red channel and high reflectance in the NIR channel will yield a high NDVI value (healthy vegetation), while the inverse will result in a low NDVI value (unhealthy vegetation). Negative values typically represent non-vegetation such as water or rock.

The NDVI raster for the whole area would be very large and computationally demanding. Therefore, I decided to choose a different approach and work with a compressed 8-bit raster (values from 0 to 255) from NASA Earth Observations (NEO), which represents a 1-month average for June. The values contained in this file have been scaled and resampled to support visualization in NEO and are not suitable for rigorous scientific analysis. However, they may be used for basic assessments and for identifying general trends, which is exactly the goal of this study. We only need to resolve the different scaling of the values. We will extract the mean NDVI values for a 3 km buffer; these distances preserve higher predictive power for covariates in studies of breeding birds (Byer et al. 2025).

View Code
ndvi = rioxarray.open_rasterio("MOD_NDVI_M_2018-06-01_rgb_3600x1800.TIFF")

ndvi_us = ndvi.rio.clip(us_boundary.to_crs(ndvi.rio.crs).geometry) # Matching CRS and cliping to study area
ndvi_us = ndvi_us.drop_vars("band").squeeze() # Getting rid of redundant dimension
ndvi_us = ndvi_us.where(ndvi_us>0) # Filtering null values

print(f"Raster CRS: {ndvi_us.rio.crs}") # Check the CRS to see if it's in degrees (4326) or meters
Raster CRS: EPSG:4326
View Code
_ = ndvi_us.plot()

We can see from the map that there is a very high NDVI on the east coast compared to the west of US. Now we can extract values from the raster for the routes buffer we created earlier. We will extract 1) the mean value of NDVI for each buffer and 2) the standard deviation of NDVI for each buffer. The latter will help estimate the heterogeneity of the vegetation.

View Code
ndvi_us = ndvi_us.rio.reproject(bbs_routes.crs) # Reprojecting to EPSG: 5070
ndvi_extraction = bbs_routes.to_crs(ndvi_us.rio.crs)

zonal_stats = ndvi_us.drop_vars("spatial_ref").xvec.zonal_stats(
    geometry=ndvi_extraction.buff_3000,
    x_coords="x",
    y_coords="y",
    stats=["mean", "std"],
)

Let us check if there are some missing values in the data.

View Code
ndvi = zonal_stats.xvec.to_geodataframe(name="NDVI") # Tranforming zonal statistics into geodataframe
print(ndvi.isna().sum()) # Returns the count of NaNs for every column in the dataframe
geometry     0
index        0
NDVI        66
dtype: int64
View Code
ndvi_clean = ndvi.dropna(subset=["NDVI"]) # Dropping rows where the NDVI value is missing

ndvi_stats = ndvi_clean.reset_index().pivot(
    index="index", 
    columns="zonal_statistics", 
    values="NDVI"
) # This creates a table with index as rows and mean/std/min/max as columns

ndvi_stats.columns = ["ndvi_" + col for col in ndvi_stats.columns] # Add the "ndvi_" prefix to the new columns
geometries = ndvi_clean[["index", "geometry"]].drop_duplicates("index").set_index("index") # Get the unique geometries for each index
ndvi_data = geometries.join(ndvi_stats).reset_index() # Join them together

Although the NDVI data was retrieved in 8-bit format (0–255), we can standardise it to Z-scores prior to modelling. This process centres the data and removes the original units, ensuring that the coefficients in the models are not biassed by the original bit-depth of the satellite imagery.

View Code
# Standardize the Mean (Relative Productivity)
# (Value - average of all routes) / standard deviation of all routes
ndvi_data["ndvi_mean_z"] = (ndvi_data["ndvi_mean"] - ndvi_data["ndvi_mean"].mean()) / ndvi_data["ndvi_mean"].std()

# Standardize the Std (Relative Heterogeneity)
# We treat the "std" as its own variable and standardize IT.
ndvi_data["ndvi_heterog_z"] = (ndvi_data["ndvi_std"] - ndvi_data["ndvi_std"].mean()) / ndvi_data["ndvi_std"].std()
View Code
ndvi_to_join = ndvi_data[["ndvi_mean_z", "ndvi_heterog_z", "geometry"]]

buffer = bbs_routes.set_geometry("buff_3000")

# Perform the Spatial Join
# "predicate=inner" ensures the polygons must overlap/be the same
# "how=inner" keeps only rows that exist in both sets
bbs_ndvi = gpd.sjoin(
    buffer, 
    ndvi_to_join, 
    how="inner", 
    predicate="intersects"
)

bbs_ndvi = bbs_ndvi.drop(columns="index_right") # Dropping index_right
data = bbs_ndvi.drop_duplicates(subset=["RTENO"]) # Check for duplicates
data.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1592 entries, 0 to 3730
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   RTENO             1592 non-null   int64   
 1   species_richness  1592 non-null   int64   
 2   buff_3000         1592 non-null   geometry
 3   route_midpoint    1592 non-null   geometry
 4   lon               1592 non-null   float64 
 5   lat               1592 non-null   float64 
 6   ndvi_mean_z       1592 non-null   float64 
 7   ndvi_heterog_z    1592 non-null   float64 
dtypes: float64(4), geometry(2), int64(2)
memory usage: 111.9 KB
View Code
m = data.explore(
    column="ndvi_mean_z", 
    cmap="RdYlGn",
    marker_kwds={'radius':5},
    vmin=-2.5, 
    vmax=2.5, 
    legend=True,
    tooltip=["ndvi_mean_z"], # Hover to see raw vs z-score
    tiles="CartoDB positron",
    popup=True # Click to see all data for that route
)
m
Make this Notebook Trusted to load map: File -> Trust Notebook

This map displays the relative differences in the NDVI standardised in the US. A visible challenge for our spatial model is the high density and overlap of the routes in the Northeast. This concentration introduces sampling bias (over-representing certain environments) and pseudo replication, where overlapping buffers sample nearly identical pixels, violating the assumption of independent observations.

To mitigate this, I implemented a spatial thinning procedure using the function thin_spatial_geopandas. This algorithm utilises a spatial index (sindex) to efficiently identify route buffers within a 1 m threshold and randomly selects one representative route from overlapping pairs to be retained, while discarding the rest. This ensures a more geographically balanced dataset for subsequent analysis.

View Code
def thin_spatial_geopandas(data, distance_threshold=1):
    # 1. Ensure we are working with a copy and shuffle for randomness
    gdf_temp = data.copy().sample(frac=1, random_state=42).reset_index(drop=True)
    
    keep_indices = []
    discard_indices = []
    processed = np.zeros(len(gdf_temp), dtype=bool)

    # 2. Build a spatial index
    sindex = gdf_temp.sindex

    for i in range(len(gdf_temp)):
        if processed[i]:
            continue
        
        # Keep this route
        keep_indices.append(i)
        processed[i] = True
        
        # Find all neighbors within the distance threshold
        current_geom = gdf_temp.geometry.iloc[i]
        # query returns indices of geometries whose bounding box intersects the search area
        # Then we filter by actual distance
        potential_neighbors = sindex.query(current_geom.buffer(distance_threshold))
        
        for neighbor_idx in potential_neighbors:
            if not processed[neighbor_idx] and neighbor_idx != i:
                # Double check actual distance (since query uses bounding boxes)
                if current_geom.distance(gdf_temp.geometry.iloc[neighbor_idx]) < distance_threshold:
                    discard_indices.append(neighbor_idx)
                    processed[neighbor_idx] = True
                    
    return gdf_temp.iloc[keep_indices], gdf_temp.iloc[discard_indices]

# Run the thinning (make sure your CRS is in meters, e.g., EPSG:5070)
df_thinned, df_removed = thin_spatial_geopandas(data, distance_threshold=1)

print(f"Routes kept: {len(df_thinned)}")
print(f"Routes removed: {len(df_removed)}")
Routes kept: 1442
Routes removed: 150
View Code
bbs_ndvi = df_thinned.drop(columns = "buff_3000") # We no longer need buffer geometry
bbs_ndvi = bbs_ndvi.set_geometry("route_midpoint")
bbs_ndvi.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1442 entries, 0 to 1591
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   RTENO             1442 non-null   int64   
 1   species_richness  1442 non-null   int64   
 2   route_midpoint    1442 non-null   geometry
 3   lon               1442 non-null   float64 
 4   lat               1442 non-null   float64 
 5   ndvi_mean_z       1442 non-null   float64 
 6   ndvi_heterog_z    1442 non-null   float64 
dtypes: float64(4), geometry(1), int64(2)
memory usage: 90.1 KB

2.4 Elevation data

To get the elevation data, I integrated topographic data from the NASA Digital Elevation Model (NASADEM). Since the Breeding Bird Survey data do not inherently contain altitude information, I used the Google Earth Engine (GEE) API to perform cloud-based spatial join operations.

The extraction process involved several technical steps:

Feature Transformation: Local coordinate pairs (Latitude/Longitude) for each survey route midpoint were converted into ee.Feature objects to make them compatible with the GEE spatial engine. - High-Resolution Sampling: I accessed the NASADEM Global 30m dataset. Using a 30-metre sampling scale, the extraction matched the native resolution of the satellite sensor, ensuring that the elevation value assigned to each route accurately reflects the local terrain. - Data Integration: The extracted values were pulled from the cloud using the geemap library and merged back into the primary research DataFrame using the unique route identifier (RTENO).

View Code
# 1. Initialize Earth Engine Project ID
try:
    ee.Initialize(project="my-project-ee-470714")
except Exception as e:
    ee.Authenticate()
    ee.Initialize(project="my-project-ee-470714")

# 2. Convert DataFrame rows into GEE Features
features = []
for index, row in bbs_ndvi.iterrows():
    # GEE uses [Longitude, Latitude]
    point = ee.Geometry.Point([row["lon"], row["lat"]])
    feat = ee.Feature(point, {"RTENO": row["RTENO"]})
    features.append(feat)

points_ee = ee.FeatureCollection(features)

# 3. Load the NASADEM Global 30m dataset
nasadem = ee.Image("NASA/NASADEM_HGT/001").select("elevation")

# 4. Sample the image at road midpoints
sampled_points = nasadem.sampleRegions(
    collection=points_ee,
    properties=["RTENO"],
    scale=30
)

# 5. Convert back to a Pandas DataFrame
# 5. Convert back to a Pandas DataFrame (The Direct Way)
info = sampled_points.getInfo()
features = info['features']
dict_list = [f['properties'] for f in features]
df_elev = pd.DataFrame(dict_list)
# 6. Merge the elevation into final DataFrame
data_final = bbs_ndvi.merge(df_elev[["RTENO", "elevation"]], on="RTENO", how="left")

print("Elevation successfully extracted from GEE!")
Elevation successfully extracted from GEE!
View Code
data_final["elevation"].describe()
count    1442.000000
mean      592.497226
std       648.428474
min        -1.000000
25%       162.000000
50%       314.000000
75%       816.750000
max      3289.000000
Name: elevation, dtype: float64
View Code
# Save to a GeoPackage
# Pass the column names as a list
data_final.to_file("assignment_data.gpkg", layer="bbs_routes", driver="GPKG")
View Code
data = gpd.read_file("assignment_data.gpkg")
data.head()
RTENO species_richness lon lat ndvi_mean_z ndvi_heterog_z elevation geometry
0 60130 26 -103.354305 32.340717 -2.006735 -0.573757 1073 POINT (-687991.066 1056389.17)
1 72016 65 -78.818844 41.692045 1.028855 -0.141942 508 POINT (1411937.291 2204439.401)
2 91036 55 -90.357021 44.651087 0.146810 0.078148 369 POINT (446389.92 2420135.718)
3 34068 47 -89.311522 41.466289 1.044674 -0.629305 221 POINT (553981.276 2070695.428)
4 91023 56 -89.648401 45.779301 0.873833 -0.629305 502 POINT (493986.861 2548870.791)

3 2. Spatial autocorrelation

With the data prepared, we first evaluate the extent of spatial dependency in bird species richness using a three-pronged diagnostic approach: - Moran plot - to see the global degree of clustering - Correlogram - to see how the spatial autocorrelation changes with larger neighbourhoods - LISA (Local Indicators of Spatial Association) - to see distribution of significantly spatialy autocorrelated values.

View Code
# 1. Prepare standardized variables and spatial lags
knn5 = graph.Graph.build_knn(data, k=5)
knn5 = knn5.transform("R")
# Standardize richness (Z-score)
data["richness_std"] = (data["species_richness"] - data["species_richness"].mean()) / data["species_richness"].std()
# Calculate spatial lag using the KNN5 weights you already built
data["richness_lag"] = knn5.lag(data["richness_std"])
# Calculate Moran's I
moran = esda.moran.Moran(data['species_richness'], knn5)
# Calculate correlogram
k = [5, 10, 25, 50, 75, 100, 500, 1000]
correlogram = esda.correlogram(
  geometry=data.representative_point(),
  variable=data["species_richness"],
  support=k,
  distance_type="knn",
)

# 2. Calculate LISA (Local Indicators of Spatial Association)
# We use .to_W() because Moran_Local expects a weights object
lisa = esda.moran.Moran_Local(data["species_richness"], knn5.to_W())

# 3. Setup the 3-panel figure
f, ax = plt.subplots(1, 3, figsize=(24,7))

# --- PANEL 1: MORAN SCATTERPLOT ---
sns.regplot(
    x="richness_std",
    y="richness_lag",
    data=data,
    marker=".",
    scatter_kws={"alpha": 0.3, "color": "steelblue"},
    line_kws=dict(color="lightcoral"),
    ax=ax[0]
)
ax[0].set_aspect("equal")
ax[0].axvline(0, c="black", alpha=0.5, linewidth=1)
ax[0].axhline(0, c="black", alpha=0.5, linewidth=1)
ax[0].set_title(f"Moran Plot\n\nMoran's I: {moran.I:.3f} (p-value: {moran.p_sim:.3f})", fontsize=14)
ax[0].set_xlabel("Species Richness (Standardized)")
ax[0].set_ylabel("Spatial Lag (Standardized)")

# Add Quadrant Labels
ax[0].text(0.95, 0.95, "HH", transform=ax[0].transAxes, ha="right", va="top", fontweight="bold")
ax[0].text(0.95, 0.05, "HL", transform=ax[0].transAxes, ha="right", va="bottom", fontweight="bold")
ax[0].text(0.05, 0.95, "LH", transform=ax[0].transAxes, ha="left", va="top", fontweight="bold")
ax[0].text(0.05, 0.05, "LL", transform=ax[0].transAxes, ha="left", va="bottom", fontweight="bold")

# --- PANEL 2: LISA CLUSTER MAP ---
_ = lisa.plot(data, crit_value=0.05, ax=ax[1], alpha=0.6, markersize=20) 
ax[1].set_title("LISA Cluster Map (p < 0.05)", fontsize=14)
contextily.add_basemap(ax[1], crs=data.crs.to_string(), source=contextily.providers.CartoDB.PositronNoLabels)

# --- PANEL 3: CORRELOGRAM ---
# Plotting the I values against the support (K neighbors)
ax[2].plot(k, correlogram.I, marker="o", linestyle="-", color="steelblue")
ax[2].axhline(0, color="black", linestyle="--", alpha=0.5)
ax[2].set_title("Spatial Correlogram", fontsize=14)
ax[2].set_xlabel("Number of Nearest Neighbors (K)")
ax[2].set_ylabel("Moran's I")
ax[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
(a) Distribution of bird species across the study area.
(b)
Figure 1

The Global Moran’s I value confirms a strong, positive spatial autocorrelation, indicating that the species richness is geographically clustered. The LISA map reveals a distinct East-West dichotomy: the Eastern US is dominated by biodiversity hotspots (High-High), likely driven by higher precipitation and vegetation density, while the arid West exhibits extensive coldspots (Low-Low).

An interesting anomaly is the coldspot cluster in Florida. While Florida is ecologically rich in wetlands, so I would not expect a clustering of lower species richness values.

Finally, the spatial correlogram demonstrates that spatial influence is strongest at small scales but drops significantly until the neighbourhood size reaches approximately 100 neighbours. This inflexion point suggests that the ecological processes driving bird richness operate within a specific regional range.

4 3. Linear regression

4.1 OLS model

Now we can continue with linear regression. We already know, that there is a high spatial autocorrelation, therefore our ultimate goal is geographically weighted linear regression (GWR), but first we will build simple OLS model and identify the residuals. However this model will already consist of two spatial predictors - longitude and latitude.

View Code
# Define our dependent variable (target) and independent variables (predictors)
dependent = "species_richness"
independents = ["ndvi_mean_z", "ndvi_heterog_z", "elevation", "lon", "lat"]

# Construct the formula string: "species_richness ~ ndvi_mean_z + ndvi_heterog_z + elevation"
formula = f"{dependent} ~ {" + ".join(independents)}"
print(f"OLS Formula: {formula}")
OLS Formula: species_richness ~ ndvi_mean_z + ndvi_heterog_z + elevation + lon + lat
View Code
# Fit the Ordinary Least Squares (OLS) model
ols = smf.ols(formula, data=data).fit()

ols.summary()
OLS Regression Results
Dep. Variable: species_richness R-squared: 0.329
Model: OLS Adj. R-squared: 0.327
Method: Least Squares F-statistic: 141.1
Date: Fri, 13 Feb 2026 Prob (F-statistic): 6.76e-122
Time: 14:09:35 Log-Likelihood: -5542.3
No. Observations: 1442 AIC: 1.110e+04
Df Residuals: 1436 BIC: 1.113e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 51.1524 3.096 16.521 0.000 45.079 57.226
ndvi_mean_z 5.8241 0.449 12.976 0.000 4.944 6.705
ndvi_heterog_z 1.3671 0.313 4.368 0.000 0.753 1.981
elevation -0.0015 0.001 -2.415 0.016 -0.003 -0.000
lon 0.1071 0.029 3.693 0.000 0.050 0.164
lat 0.3419 0.066 5.201 0.000 0.213 0.471
Omnibus: 10.586 Durbin-Watson: 1.967
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.957
Skew: 0.141 Prob(JB): 0.00253
Kurtosis: 3.346 Cond. No. 9.16e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The Global OLS model indicates that primary productivity (ndvi_mean_z) is the dominant driver of bird species richness across the contiguous US and the model explains about one third of the data (R2=0.329). Bird diversity slightly decreases with higher elevation and increases with vegetation heterogeneity. Longitude and latitude have very small explanatory power. Furthermore, the OLS model assumes these relationships are constant across the continent. To investigate whether these drivers vary geographically—for instance, if NDVI is more important in the arid West than the humid East—I will now proceed to Geographically Weighted Regression (GWR). But first we take a look at the spatial perspective of the OLS model.

View Code
predicted = ols.predict(data)
data["residual"] = ols.resid
max_residual = ols.resid.abs().max()

f, axs = plt.subplots(1, 3, figsize=(24, 7))
data.plot(
    predicted, legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[0], alpha = 0.8, markersize=30
)
data.plot(
    "species_richness", legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[1], alpha = 0.8, markersize=30
)

data.plot(
    "residual", legend=True, cmap="RdBu", vmin=-max_residual, vmax=max_residual, ax=axs[2], alpha = 0.8, markersize=30
)

res_moran = esda.moran.Moran(data["residual"], knn5)


axs[0].set_title("OLS prediction", fontsize=14)
axs[1].set_title("Actual results", fontsize=14)
axs[2].set_title(f"Residuals\n\nMoran's I: {res_moran.I:.3f} (p-value: {res_moran.p_sim:.3f})", fontsize=14)

axs[0].set_axis_off()
axs[1].set_axis_off()
axs[2].set_axis_off()
plt.tight_layout()

In the first map we can observe that the model predicts the expected difference in the species richness between East and West of US. Hence, it both overestimates and underestimates the predictions in some areas, because our data (middle map) shows more diverse pattern. The map on the right present’s residuals, locations where models fail to explain our dependent variable. You can spot the blue areas which have higher species richness than expected by the model, while the red areas show the opposite (including Florida, which has lower species diversity than expected). Moran’s I indicates that residuals form spatial clusters.

4.2 Geographically weighted regression

View Code
# 1. Define Dependent (y) and Independent (X) variables
# We reshape y to (-1, 1) to ensure it is a column vector for matrix math
y = data["species_richness"].values.reshape((-1, 1))

# We select our environmental predictors
# We exclude lon/lat from X because they are used in "coords" instead
X = data[["ndvi_mean_z", "ndvi_heterog_z", "elevation"]].values

# 2. Define the spatial coordinates
coords = data.centroid.get_coordinates().values
View Code
# Initialize the bandwidth selector
bw_selector = Sel_BW(coords, y, X, fixed=False, multi=False)

# Search for the optimal bandwidth (number of nearest neighbors)
opt_bw = bw_selector.search()

print(f"Optimal bandwidth (neighbors): {opt_bw}")
Optimal bandwidth (neighbors): 80.0
View Code
# Fit the model
gwr_model = GWR(coords, y, X, opt_bw).fit()

# View the global summary of the local models
gwr_model.summary()
===========================================================================
Model type                                                         Gaussian
Number of observations:                                                1442
Number of covariates:                                                     4

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                         188213.517
Log-likelihood:                                                   -5558.494
AIC:                                                              11124.988
AICc:                                                             11127.030
BIC:                                                             177753.813
R2:                                                                   0.314
Adj. R2:                                                              0.313

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  54.352      0.458    118.587      0.000
X1                                   7.121      0.383     18.574      0.000
X2                                   1.381      0.313      4.420      0.000
X3                                  -0.001      0.001     -2.067      0.039

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Bandwidth used:                                                      80.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                         122138.628
Effective number of parameters (trace(S)):                          160.282
Degree of freedom (n - trace(S)):                                  1281.718
Sigma estimate:                                                       9.762
Log-likelihood:                                                   -5246.719
AIC:                                                              10816.001
AICc:                                                             10856.906
BIC:                                                              11666.566
R2:                                                                   0.555
Adjusted R2:                                                          0.499
Adj. alpha (95%):                                                     0.001
Adj. critical t value (95%):                                          3.234

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       55.634      9.885     19.849     55.695     85.198
X1                        4.078      5.228    -14.156      4.573     16.293
X2                        1.041      2.064     -6.011      1.245      6.923
X3                        0.002      0.037     -0.083     -0.002      0.200
===========================================================================

The GWR model increased the explained variability to 55 % and we can now do a visual comparison with the OLS model and actual data.

View Code
# Create a copy to store results
data_results = data.copy()

# Add Local R2 (Where does the model perform best?)
data_results['gwr_R2'] = gwr_model.localR2

# Add Local Coefficients
# Since X had [ndvi_mean_z, ndvi_heterog_z, elevation], the params match that order
data_results['coef_ndvi'] = gwr_model.params[:, 0]
data_results['coef_hetero'] = gwr_model.params[:, 1]
data_results['coef_elev'] = gwr_model.params[:, 2]

# Add T-values (to check local significance)
# A t-value > 1.96 or < -1.96 means the relationship is significant at 95%
data_results['t_ndvi'] = gwr_model.tvalues[:, 0]
View Code
# 1. Define the maximum richness for a consistent scale
v_min = 0
v_max = 100 

fig, axs = plt.subplots(1, 3, figsize=(24, 7))

# --- Plot 1: OLS Prediction ---
# We use the global OLS model to predict based on the dataframe
data.plot(
    ols.predict(data), 
    legend=True, cmap="coolwarm", vmin=v_min, vmax=v_max, ax=axs[0], markersize=40
)
axs[0].set_title("OLS Prediction (Global Model)", fontsize=14)

# --- Plot 2: GWR Prediction ---
# 'predy' contains the local predictions from the GWR model
data.plot(
    gwr_model.predy.flatten(), 
    legend=True, cmap="coolwarm", vmin=v_min, vmax=v_max, ax=axs[1], markersize=40
)
axs[1].set_title("GWR Prediction (Local Model)", fontsize=14)

# --- Plot 3: Actual Data ---
# The ground truth: your 'species_richness' column
data.plot(
    "species_richness", 
    legend=True, cmap="coolwarm", vmin=v_min, vmax=v_max, ax=axs[2], markersize=40
)
axs[2].set_title("Actual Species Richness", fontsize=14)

# Cleanup
for ax in axs:
    ax.set_axis_off()

plt.tight_layout()
plt.show()

As we can see from the maps, the GWR model better explains the variability on the actual data further highliting the contrast between East and West, however still fails to predicts very high or very low values in some areas - to look at them specifically, we can examine the local Beta coefficients.

View Code
# 1. Define significance threshold
sig95 = gwr_model.adj_alpha[1]
critical_t = gwr_model.critical_tval(alpha=sig95)
critical_t
significant = np.abs(gwr_model.tvalues) > critical_t

# 2. Get the actual number of variables from the model
num_vars = gwr_model.params.shape[1]

# 3. Build var_names safely
# 'independents' usually includes the target or coords, so we slice it 
# to match the X matrix you fed into the GWR
# If your X was [ndvi_mean_z, ndvi_heterog_z, elevation], then:
actual_predictor_names = ["ndvi_mean_z", "ndvi_heterog_z", "elevation"]
var_names = ["Intercept"] + actual_predictor_names

# 4. Create the grid
cols = 2
rows = (num_vars + cols - 1) // cols 

fig, axs = plt.subplots(rows, cols, figsize=(12, rows * 5))
axs = axs.flatten()

for i in range(num_vars):
    # Plot the local coefficients
    data.plot(
        gwr_model.params[:, i], 
        cmap="coolwarm", 
        ax=axs[i], 
        markersize=15, # Increased size for better visibility
        legend=True,
        legend_kwds={"shrink": 0.5}
    )
    
    # Significance Mask
    non_sig_mask = ~significant[:, i]
    if non_sig_mask.any():
        data[non_sig_mask].plot(
            color="white", 
            ax=axs[i], 
            alpha=0.7, 
            markersize=15
        )
    
    # Check if we have a name for this index to avoid IndexError
    title_name = var_names[i] if i < len(var_names) else f"Variable {i}"
    axs[i].set_title(f"Local Coef: {title_name}", fontsize=12)
    axs[i].set_axis_off()

# Hide any extra axes
for j in range(i + 1, len(axs)):
    axs[j].set_axis_off()

plt.tight_layout()
plt.show()

The intercept coefficient shows the species richness that was not explained by the model. We can see clusters of species poor routes in the rain shadow of the Rocky Mountains and Great Basin, while biodiversity hotpost are situated mostly in the area of Great Lakes. These patterns, not captured by the model, could be also explained by combination of factors like regional history, presence of nature protective areas and human influence. Relative NDVI significantly drives bird diversity in the arid west, especially in desert habitats, where only a slight increase in vegetation can result in a more diverse bird community. The heterogeneity of the vegetation is significant only in the Great Basin and has not a very strong effect. Finally, in Florida, elevation implies the explanation of species richness, but this is very probably not the right prediction as Florida has very low elevation heterogeneity and it only significantly differs from the other states in the average elevation, but this is surely not the driver of bird diversity in this area. The generally lower species diversity of Florida Peninsula in the model is driven by some factors which were not included - maybe high human population density or climate unstability.

5 Conclusion

This study confirms that bird species richness is driven by spatially varying processes. Global Moran’s I and LISA maps reveal a sharp East-West dichotomy, with the East acting as a biodiversity hotspot and the West as a coldspot. While OLS models capture broad trends, GWR significantly improves accuracy by accounting for local “non-stationarity.”

Local coefficients show that NDVI is a vital driver in the arid West, where marginal greenness spikes diversity. Despite the temporarly limiting data and proxy variable for NDVI, the model sufficiently revealed the expected SR~NDVI relationship, but should be treated as explorational with higly limiting significance of the results.

5.1 References

Brüggeshemke, Jonas, and Thomas Fartmann. 2025. “Predicting Species Richness and Abundance of Breeding Birds by Remote-Sensing and Field-Survey Data.” Ecological Solutions and Evidence 6 (4): e70170. https://doi.org/10.1002/2688-8319.70170.
Byer, Nathan W., Remington J. Moll, Timothy J. Krynak, et al. 2025. “Breeding Bird Sensitivity to Urban Habitat Quality Is Multi-Scale and Strongly Dependent on Migratory Behavior.” Ecological Applications 35 (1): e3087. https://doi.org/10.1002/eap.3087.
Di Cecco, Grace J., Sara J. Snell Taylor, Ethan P. White, and Allen H. Hurlbert. 2022. “More Individuals or Specialized Niches? Distinguishing Support for Hypotheses Explaining Positive Species–Energy Relationships.” Journal of Biogeography 49 (9): 1629–39. https://doi.org/10.1111/jbi.14459.
Hawkins, Bradford A., Eric E. Porter, and José Alexandre Felizola Diniz-Filho. 2003. “Productivity and History as Predictors of the Latitudinal Diversity Gradient of Terrestrial Birds.” Ecology 84 (6): 1608–23. https://doi.org/10.1890/0012-9658(2003)084[1608:PAHAPO]2.0.CO;2.
Mittelbach, Gary G., Christopher F. Steiner, Samuel M. Scheiner, et al. 2001. “What Is the Observed Relationship Between Species Richness and Productivity?” Ecology 82 (9): 2381–96. https://doi.org/10.1890/0012-9658(2001)082[2381:WITORB]2.0.CO;2.
Nieto, Sebastián, Pedro Flombaum, and Martín F. Garbulsky. 2015. “Can Temporal and Spatial NDVI Predict Regional Bird-Species Richness?” Global Ecology and Conservation 3: 729–35. https://doi.org/10.1016/j.gecco.2015.03.005.
Pigot, Alexander L., Joseph A. Tobias, and Walter Jetz. 2016. “Energetic Constraints on Species Coexistence in Birds.” PLOS Biology 14 (3): e1002407. https://doi.org/10.1371/journal.pbio.1002407.

5.1.1 Data Sources